Scalable Phrase Mining for Ad-hoc Text Analytics

نویسندگان

  • Srikanta Bedathur
  • Klaus Berberich
  • Jens Dittrich
  • Nikos Mamoulis
  • Gerhard Weikum
چکیده

Large text corpora with news, customer mail and reports, or Web 2.0 contributions offer a great potential for enhancing business-intelligence applications. We propose a framework for performing text analytics on such data in a versatile, efficient, and scalable manner. While much of the prior literature has emphasized mining keywords or tags in blogs or social-tagging communities, we emphasize the analysis of interesting phrases. These include named entities, important quotations, market slogans, and other multi-word phrases that are prominent in a dynamically derived ad-hoc subset of the corpus, e.g., being frequent in the subset but relatively infrequent in the overall corpus. The ad-hoc subset may be derived by means of a keyword query against the corpus, or by focusing on a particular time period. We investigate alternative definitions of phrase interestingness, based on the probability of phrase occurrences. We develop preprocessing and indexing methods for phrases, paired with new search techniques for the top-k most interesting phrases on ad-hoc subsets of the corpus. Our framework is evaluated using a large-scale real-world corpus of New York Times news articles.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Interesting-Phrase Mining for Ad-Hoc Text Analytics

Large text corpora with news, customer mail and reports, or Web 2.0 contributions offer a great potential for enhancing business-intelligence applications. We propose a framework for performing text analytics on such data in a versatile, efficient, and scalable manner. While much of the prior literature has emphasized mining keywords or tags in blogs or social-tagging communities, we emphasize ...

متن کامل

Pipelines for Ad-hoc Large-scale Text Mining

Pipelines for Ad-hoc Large-scale Text Mining Today’s web search and big data analytics applications aim to address information needs (typically given in the form of search queries) ad-hoc on large numbers of texts. In order to directly return relevant information instead of only returning potentially relevant texts, these applications have begun to employ text mining. The term text mining cover...

متن کامل

Design and evaluation of two scalable protocols for location management of mobile nodes in location based routing protocols in mobile Ad Hoc Networks

Heretofore several position-based routing protocols have been developed for mobile ad hoc networks. Many of these protocols assume that a location service is available which provides location information on the nodes in the network.Our solutions decrease location update without loss of query success rate or throughput and even increase those.Simulation results show that our methods are effectiv...

متن کامل

Design and evaluation of two scalable protocols for location management of mobile nodes in location based routing protocols in mobile Ad Hoc Networks

Heretofore several position-based routing protocols have been developed for mobile ad hoc networks. Many of these protocols assume that a location service is available which provides location information on the nodes in the network.Our solutions decrease location update without loss of query success rate or throughput and even increase those.Simulation results show that our methods are effectiv...

متن کامل

Notes on Phrasal Indexing: JSCB Evaluation Experiments at NTCIR AD HOC

The evaluation experiments of the JSCB team are described with a focus on noun phrase indexing and its weighting issues in ad hoc text retrieval. Experiments on the effects of supplemental noun phrase indexing in view of the effect of various length of queries are reported. The results show that the noun phrase indexing outperforms single word only indexing with long queries while single word o...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009